import os
from IPython.display import display, HTML
ModuleFolder='C:\\Users\\Gamaliel\\Documents\\G\\ADD\\IBM_DS\\Data_Analysis_Py\\'
os.chdir(ModuleFolder)
Module 1¶
Lesson Summary
- Each line in a dataset is a row, and commas separate the values.
- To understand the data, you must analyze the attributes for each column of data.
- Python libraries are collections of functions and methods that facilitate various functionalities without writing code from scratch and are categorized into
- Scientific Computing
- Data Visualization
- Machine Learning Algorithms.
- Many data science libraries are interconnected; for instance, Scikit-learn is built on top of NumPy, SciPy, and Matplotlib.
- The data format and the file path are two key factors for reading data with Pandas.
- The read_CSV method in Pandas can read files in CSV format into a Pandas DataFrame.
- Pandas has unique data types like object, float, Int, and datetime.
- Use the dtype method to check each column’s data type; misclassified data types might need manual correction.
- Knowing the correct data types helps apply appropriate Python functions to specific columns.
- Using Statistical Summary with describe() provides count, mean, standard deviation, min, max, and quartile ranges for numerical columns.
- You can also use include='all' as an argument to get summaries for object-type columns.
- The statistical summary helps identify potential issues like outliers needing further attention.
- Using the info() Method gives an overview of the top and bottom 30 rows of the DataFrame, useful for quick visual inspection.
- Some statistical metrics may return "NaN," indicating missing values, and the program can’t calculate statistics for that specific data type.
- Python can connect to databases through specialized code, often written in Jupyter notebooks.
- SQL Application Programming Interfaces (APIs) and Python DB APIs (most often used) facilitate the interaction between Python and the DBMS.
- SQL APIs connect to DBMS with one or more API calls, build SQL statements as a text string, and use API calls to send SQL statements to the DBMS and retrieve results and statuses.
- DB-API, Python's standard for interacting with relational databases, uses connection objects to establish and manage database connections and cursor objects to run queries and scroll through the results.
- Connection Object methods include the cursor(), commit(), rollback(), and close() commands.
- You can import the database module, use the Connect API to open a connection, and then create a cursor object to run queries and fetch results.
- Remember to close the database connection to free up resources.
#from pyodide.http import pyfetch
#async def download(url, filename):
# response = await pyfetch(url)
# if response.status == 200:
# with open(filename, "wb") as f:
# f.write(await response.bytes())
import pandas as pd
Jupyter_Notesath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod1.csv"
df = pd.read_csv(Jupyter_Notesath, header=0)
df
| Unnamed: 0 | Manufacturer | Category | Screen | GPU | OS | CPU_core | Screen_Size_cm | CPU_frequency | RAM_GB | Storage_GB_SSD | Weight_kg | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Acer | 4 | IPS Panel | 2 | 1 | 5 | 35.560 | 1.6 | 8 | 256 | 1.60 | 978 |
| 1 | 1 | Dell | 3 | Full HD | 1 | 1 | 3 | 39.624 | 2.0 | 4 | 256 | 2.20 | 634 |
| 2 | 2 | Dell | 3 | Full HD | 1 | 1 | 7 | 39.624 | 2.7 | 8 | 256 | 2.20 | 946 |
| 3 | 3 | Dell | 4 | IPS Panel | 2 | 1 | 5 | 33.782 | 1.6 | 8 | 128 | 1.22 | 1244 |
| 4 | 4 | HP | 4 | Full HD | 2 | 1 | 7 | 39.624 | 1.8 | 8 | 256 | 1.91 | 837 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 233 | 233 | Lenovo | 4 | IPS Panel | 2 | 1 | 7 | 35.560 | 2.6 | 8 | 256 | 1.70 | 1891 |
| 234 | 234 | Toshiba | 3 | Full HD | 2 | 1 | 5 | 33.782 | 2.4 | 8 | 256 | 1.20 | 1950 |
| 235 | 235 | Lenovo | 4 | IPS Panel | 2 | 1 | 5 | 30.480 | 2.6 | 8 | 256 | 1.36 | 2236 |
| 236 | 236 | Lenovo | 3 | Full HD | 3 | 1 | 5 | 39.624 | 2.5 | 6 | 256 | 2.40 | 883 |
| 237 | 237 | Toshiba | 3 | Full HD | 2 | 1 | 5 | 35.560 | 2.3 | 8 | 256 | 1.95 | 1499 |
238 rows × 13 columns
Module 3¶
- Tools like the 'describe' function in pandas can quickly calculate key statistical measures like mean, standard deviation, and quartiles for all numerical variables in your data frame.
- Use the 'value_counts' function to summarize data into different categories for categorical data.
- Box plots offer a more visual representation of the data's distribution for numerical data, indicating features like the median, quartiles, and outliers.
- Scatter plots are excellent for exploring relationships between continuous variables, like engine size and price, in a car data set.
- Use Pandas' 'groupby' method to explore relationships between categorical variables.
- Use pivot tables and heat maps for better data visualizations.
- Correlation between variables is a statistical measure that indicates how the changes in one variable might be associated with changes in another variable.
- When exploring correlation, use scatter plots combined with a regression line to visualize relationships between variables.
- Visualization functions like regplot, from the seaborn library, are especially useful for exploring correlation.
- The Pearson correlation, a key method for assessing the correlation between continuous numerical variables, provides two critical values—the coefficient, which indicates the strength and direction of the correlation, and the P-value, which assesses the certainty of the correlation.
- A correlation coefficient close to 1 or -1 indicates a strong positive or negative correlation, respectively, while one close to zero suggests no correlation.
- For P-values, values less than .001 indicate strong certainty in the correlation, while larger values indicate less certainty. Both the coefficient and P-value are important for confirming a strong correlation.
- Heatmaps provide a comprehensive visual summary of the strength and direction of correlations among multiple variables.
Module 4¶
Notebook for EDA¶
Import Data from Module 2¶
Setup
Import libraries:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#! mamba install seaborn=0.9.0-y
import pandas as pd
import numpy as np
Download the updated dataset by running the cell below.
The functions below will download the dataset into your browser and store it in dataframe df:
file_path= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv"
#await download(file_path, "usedcars.csv")
file_name="usedcars.csv"
df = pd.read_csv(file_path, header=0)
Note: This version of the lab is working on JupyterLite, which requires the dataset to be downloaded to the interface.While working on the downloaded version of this notebook on their local machines(Jupyter Anaconda), the learners can simply skip the steps above, and simply use the URL directly in the
pandas.read_csv()function. You can uncomment and run the statements in the cell below.
#Jupyter_Notesath='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv'
#df = pd.read_csv(Jupyter_Notesath, header=None)
View the first 5 values of the updated dataframe using dataframe.head()
df.head()
| symboling | normalized-losses | make | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | ... | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | horsepower-binned | diesel | gas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122 | alfa-romero | std | two | convertible | rwd | front | 88.6 | 0.811148 | ... | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 | 11.190476 | Medium | 0 | 1 |
| 1 | 3 | 122 | alfa-romero | std | two | convertible | rwd | front | 88.6 | 0.811148 | ... | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 | 11.190476 | Medium | 0 | 1 |
| 2 | 1 | 122 | alfa-romero | std | two | hatchback | rwd | front | 94.5 | 0.822681 | ... | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 | 12.368421 | Medium | 0 | 1 |
| 3 | 2 | 164 | audi | std | four | sedan | fwd | front | 99.8 | 0.848630 | ... | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 | 9.791667 | Medium | 0 | 1 |
| 4 | 2 | 164 | audi | std | four | sedan | 4wd | front | 99.4 | 0.848630 | ... | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 | 13.055556 | Medium | 0 | 1 |
5 rows × 29 columns
Analyzing Individual Feature Patterns Using Visualization¶
To install Seaborn we use pip, the Python package manager.
Import visualization packages "Matplotlib" and "Seaborn". Don't forget about "%matplotlib inline" to plot in a Jupyter notebook.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
How to choose the right visualization method?
When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.
# list the data types for each column
print(df.dtypes)
symboling int64 normalized-losses int64 make object aspiration object num-of-doors object body-style object drive-wheels object engine-location object wheel-base float64 length float64 width float64 height float64 curb-weight int64 engine-type object num-of-cylinders object engine-size int64 fuel-system object bore float64 stroke float64 compression-ratio float64 horsepower float64 peak-rpm float64 city-mpg int64 highway-mpg int64 price float64 city-L/100km float64 horsepower-binned object diesel int64 gas int64 dtype: object
df.info
<bound method DataFrame.info of symboling normalized-losses make aspiration num-of-doors \
0 3 122 alfa-romero std two
1 3 122 alfa-romero std two
2 1 122 alfa-romero std two
3 2 164 audi std four
4 2 164 audi std four
.. ... ... ... ... ...
196 -1 95 volvo std four
197 -1 95 volvo turbo four
198 -1 95 volvo std four
199 -1 95 volvo turbo four
200 -1 95 volvo turbo four
body-style drive-wheels engine-location wheel-base length ... \
0 convertible rwd front 88.6 0.811148 ...
1 convertible rwd front 88.6 0.811148 ...
2 hatchback rwd front 94.5 0.822681 ...
3 sedan fwd front 99.8 0.848630 ...
4 sedan 4wd front 99.4 0.848630 ...
.. ... ... ... ... ... ...
196 sedan rwd front 109.1 0.907256 ...
197 sedan rwd front 109.1 0.907256 ...
198 sedan rwd front 109.1 0.907256 ...
199 sedan rwd front 109.1 0.907256 ...
200 sedan rwd front 109.1 0.907256 ...
compression-ratio horsepower peak-rpm city-mpg highway-mpg price \
0 9.0 111.0 5000.0 21 27 13495.0
1 9.0 111.0 5000.0 21 27 16500.0
2 9.0 154.0 5000.0 19 26 16500.0
3 10.0 102.0 5500.0 24 30 13950.0
4 8.0 115.0 5500.0 18 22 17450.0
.. ... ... ... ... ... ...
196 9.5 114.0 5400.0 23 28 16845.0
197 8.7 160.0 5300.0 19 25 19045.0
198 8.8 134.0 5500.0 18 23 21485.0
199 23.0 106.0 4800.0 26 27 22470.0
200 9.5 114.0 5400.0 19 25 22625.0
city-L/100km horsepower-binned diesel gas
0 11.190476 Medium 0 1
1 11.190476 Medium 0 1
2 12.368421 Medium 0 1
3 9.791667 Medium 0 1
4 13.055556 Medium 0 1
.. ... ... ... ...
196 10.217391 Medium 0 1
197 12.368421 High 0 1
198 13.055556 Medium 0 1
199 9.038462 Medium 1 0
200 12.368421 Medium 0 1
[201 rows x 29 columns]>
Question #1:
What is the data type of the column "peak-rpm"?# Write your code below and press Shift+Enter to execute
df['peak-rpm'].dtypes
dtype('float64')
Click here for the solution
```python df['peak-rpm'].dtypes ```For example, we can calculate the correlation between variables of type "int64" or "float64" using the method "corr":
# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=['float64', 'int64'])
corres=numeric_df.corr()
fix, axs=plt.subplots()
mat=axs.pcolor(corres, cmap='coolwarm')
cols=list(numeric_df.columns)
#print(cols)
axs.set_xticks(list(np.arange(0.5,len(cols)+.5,1)))
plt.xticks(rotation=90)
axs.set_xticklabels(cols)
axs.set_yticks(list(np.arange(.5,len(cols)+.5,1)))
axs.set_yticklabels(cols)
axs.set_aspect('equal', 'box')
fix.colorbar(mat)
fix.tight_layout()
plt.title('Car properties correlation')
plt.show();
The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.
Question #2:
Find the correlation between the following columns: bore, stroke, compression-ratio, and horsepower.
Hint: if you would like to select those columns, use the following syntax: df[['bore','stroke','compression-ratio','horsepower']]
# Write your code below and press Shift+Enter to execute
df[['bore','stroke','compression-ratio','horsepower']].corr()
| bore | stroke | compression-ratio | horsepower | |
|---|---|---|---|---|
| bore | 1.000000 | -0.055390 | 0.001263 | 0.566936 |
| stroke | -0.055390 | 1.000000 | 0.187923 | 0.098462 |
| compression-ratio | 0.001263 | 0.187923 | 1.000000 | -0.214514 |
| horsepower | 0.566936 | 0.098462 | -0.214514 | 1.000000 |
Click here for the solution
```python df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr() ```Continuous Numerical Variables:
Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.
In order to start understanding the (linear) relationship between an individual variable and the price, we can use "regplot" which plots the scatterplot plus the fitted regression line for the data. This will be useful later on for visualizing the fit of the simple linear regression model as well.
Let's see several examples of different linear relationships:
Positive Linear Relationship</h4>
Let's find the scatterplot of "engine-size" and "price".
# Engine size as potential predictor variable of price
plt.figure(10)
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
plt.show()
As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
We can examine the correlation between 'engine-size' and 'price' and see that it's approximately 0.87.
df[["engine-size", "price"]].corr()
| engine-size | price | |
|---|---|---|
| engine-size | 1.000000 | 0.872335 |
| price | 0.872335 | 1.000000 |
Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-mpg" and "price".
plt.figure(20)
sns.regplot(x="highway-mpg", y="price", data=df)
plt.show()
As highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.
We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704.
df[['highway-mpg', 'price']].corr()
| highway-mpg | price | |
|---|---|---|
| highway-mpg | 1.000000 | -0.704692 |
| price | -0.704692 | 1.000000 |
Weak Linear Relationship
Let's see if "peak-rpm" is a predictor variable of "price".
plt.figure(30)
sns.regplot(x="peak-rpm", y="price", data=df)
plt.show()
Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.
We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616.
df[['peak-rpm','price']].corr()
| peak-rpm | price | |
|---|---|---|
| peak-rpm | 1.000000 | -0.101616 |
| price | -0.101616 | 1.000000 |
Question 3 a):
Find the correlation between x="stroke" and y="price".
Hint: if you would like to select those columns, use the following syntax: df[["stroke","price"]].
# Write your code below and press Shift+Enter to execute
df[['stroke','price']].corr()
| stroke | price | |
|---|---|---|
| stroke | 1.00000 | 0.08231 |
| price | 0.08231 | 1.00000 |
Click here for the solution
```python #The correlation is 0.0823, the non-diagonal elements of the table. df[["stroke","price"]].corr() ```Question 3 b):
Given the correlation results between "price" and "stroke", do you expect a linear relationship?
Verify your results using the function "regplot()".
# Write your code below and press Shift+Enter to execute
Answer='no'
plt.figure(0)
sns.regplot(x=df['stroke'],y=df['price'])
plt.show()
Click here for the solution
```python #There is a weak correlation between the variable 'stroke' and 'price.' as such regression will not work well. We can see this using "regplot" to demonstrate this. #Code: sns.regplot(x="stroke", y="price", data=df) ```Categorical Variables
These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.
Let's look at the relationship between "body-style" and "price".
plt.figure(1)
sns.boxplot(x="body-style", y="price", data=df)
plt.show()
We see that the distributions of price between the different body-style categories have a significant overlap, so body-style would not be a good predictor of price. Let's examine engine "engine-location" and "price":
plt.figure(2)
sns.boxplot(x="engine-location", y="price", data=df)
plt.show()
Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.
Let's examine "drive-wheels" and "price".
# drive-wheels
plt.figure(3)
sns.boxplot(x="drive-wheels", y="price", data=df)
plt.show()
Here we see that the distribution of price between the different drive-wheels categories differs. As such, drive-wheels could potentially be a predictor of price.
Descriptive Statistical Analysis¶
Let's first take a look at the variables by utilizing a description method.
The describe function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.
This will show:
- the count of that variable
- the mean
- the standard deviation (std)
- the minimum value
- the IQR (Interquartile Range: 25%, 50% and 75%)
- the maximum value
We can apply the method "describe" as follows:
df.describe()
| symboling | normalized-losses | wheel-base | length | width | height | curb-weight | engine-size | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | diesel | gas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 201.000000 | 201.00000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 197.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 |
| mean | 0.840796 | 122.00000 | 98.797015 | 0.837102 | 0.915126 | 53.766667 | 2555.666667 | 126.875622 | 3.330692 | 3.256904 | 10.164279 | 103.405534 | 5117.665368 | 25.179104 | 30.686567 | 13207.129353 | 9.944145 | 0.099502 | 0.900498 |
| std | 1.254802 | 31.99625 | 6.066366 | 0.059213 | 0.029187 | 2.447822 | 517.296727 | 41.546834 | 0.268072 | 0.319256 | 4.004965 | 37.365700 | 478.113805 | 6.423220 | 6.815150 | 7947.066342 | 2.534599 | 0.300083 | 0.300083 |
| min | -2.000000 | 65.00000 | 86.600000 | 0.678039 | 0.837500 | 47.800000 | 1488.000000 | 61.000000 | 2.540000 | 2.070000 | 7.000000 | 48.000000 | 4150.000000 | 13.000000 | 16.000000 | 5118.000000 | 4.795918 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 101.00000 | 94.500000 | 0.801538 | 0.890278 | 52.000000 | 2169.000000 | 98.000000 | 3.150000 | 3.110000 | 8.600000 | 70.000000 | 4800.000000 | 19.000000 | 25.000000 | 7775.000000 | 7.833333 | 0.000000 | 1.000000 |
| 50% | 1.000000 | 122.00000 | 97.000000 | 0.832292 | 0.909722 | 54.100000 | 2414.000000 | 120.000000 | 3.310000 | 3.290000 | 9.000000 | 95.000000 | 5125.369458 | 24.000000 | 30.000000 | 10295.000000 | 9.791667 | 0.000000 | 1.000000 |
| 75% | 2.000000 | 137.00000 | 102.400000 | 0.881788 | 0.925000 | 55.500000 | 2926.000000 | 141.000000 | 3.580000 | 3.410000 | 9.400000 | 116.000000 | 5500.000000 | 30.000000 | 34.000000 | 16500.000000 | 12.368421 | 0.000000 | 1.000000 |
| max | 3.000000 | 256.00000 | 120.900000 | 1.000000 | 1.000000 | 59.800000 | 4066.000000 | 326.000000 | 3.940000 | 4.170000 | 23.000000 | 262.000000 | 6600.000000 | 49.000000 | 54.000000 | 45400.000000 | 18.076923 | 1.000000 | 1.000000 |
The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:
df.describe(include=['object'])
| make | aspiration | num-of-doors | body-style | drive-wheels | engine-location | engine-type | num-of-cylinders | fuel-system | horsepower-binned | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 201 | 201 | 201 | 201 | 201 | 201 | 201 | 201 | 201 | 200 |
| unique | 22 | 2 | 2 | 5 | 3 | 2 | 6 | 7 | 8 | 3 |
| top | toyota | std | four | sedan | fwd | front | ohc | four | mpfi | Low |
| freq | 32 | 165 | 115 | 94 | 118 | 198 | 145 | 157 | 92 | 115 |
Value Counts
Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "drive-wheels". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket df['drive-wheels'], not two brackets df[['drive-wheels']].
df['drive-wheels'].value_counts()
drive-wheels fwd 118 rwd 75 4wd 8 Name: count, dtype: int64
We can convert the series to a dataframe as follows:
df['drive-wheels'].value_counts().to_frame()
| count | |
|---|---|
| drive-wheels | |
| fwd | 118 |
| rwd | 75 |
| 4wd | 8 |
Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and rename the column 'drive-wheels' to 'value_counts'.
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.reset_index(inplace=True)
drive_wheels_counts=drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'})
drive_wheels_counts
| value_counts | count | |
|---|---|---|
| 0 | fwd | 118 |
| 1 | rwd | 75 |
| 2 | 4wd | 8 |
Now let's rename the index to 'drive-wheels':
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts
| value_counts | count | |
|---|---|---|
| drive-wheels | ||
| 0 | fwd | 118 |
| 1 | rwd | 75 |
| 2 | 4wd | 8 |
We can repeat the above process for the variable 'engine-location'.
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
| count | |
|---|---|
| engine-location | |
| front | 198 |
| rear | 3 |
After examining the value counts of the engine location, we see that engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, so this result is skewed. Thus, we are not able to draw any conclusions about the engine location.
Basics of Grouping¶
The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.
For example, let's group by the variable "drive-wheels". We see that there are 3 different categories of drive wheels.
df['drive-wheels'].unique()
array(['rwd', 'fwd', '4wd'], dtype=object)
If we want to know, on average, which type of drive wheel is most valuable, we can group "drive-wheels" and then average them.
We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable "df_group_one".
df_group_one = df[['drive-wheels','body-style','price']]
We can then calculate the average price for each of the different categories of data.
# grouping results
df_grouped = df_group_one.groupby(['drive-wheels'], as_index=False).agg({'price': 'mean'})
df_grouped
| drive-wheels | price | |
|---|---|---|
| 0 | 4wd | 10241.000000 |
| 1 | fwd | 9244.779661 |
| 2 | rwd | 19757.613333 |
From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.
You can also group by multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combination of 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'.
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1
| drive-wheels | body-style | price | |
|---|---|---|---|
| 0 | 4wd | hatchback | 7603.000000 |
| 1 | 4wd | sedan | 12647.333333 |
| 2 | 4wd | wagon | 9095.750000 |
| 3 | fwd | convertible | 11595.000000 |
| 4 | fwd | hardtop | 8249.000000 |
| 5 | fwd | hatchback | 8396.387755 |
| 6 | fwd | sedan | 9811.800000 |
| 7 | fwd | wagon | 9997.333333 |
| 8 | rwd | convertible | 23949.600000 |
| 9 | rwd | hardtop | 24202.714286 |
| 10 | rwd | hatchback | 14337.777778 |
| 11 | rwd | sedan | 21711.833333 |
| 12 | rwd | wagon | 16994.222222 |
This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot" to create a pivot table from the groups.
In this case, we will leave the drive-wheels variable as the rows of the table, and pivot body-style to become the columns of the table:
grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot
| price | |||||
|---|---|---|---|---|---|
| body-style | convertible | hardtop | hatchback | sedan | wagon |
| drive-wheels | |||||
| 4wd | NaN | NaN | 7603.000000 | 12647.333333 | 9095.750000 |
| fwd | 11595.0 | 8249.000000 | 8396.387755 | 9811.800000 | 9997.333333 |
| rwd | 23949.6 | 24202.714286 | 14337.777778 | 21711.833333 | 16994.222222 |
Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.
grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot
| price | |||||
|---|---|---|---|---|---|
| body-style | convertible | hardtop | hatchback | sedan | wagon |
| drive-wheels | |||||
| 4wd | 0.0 | 0.000000 | 7603.000000 | 12647.333333 | 9095.750000 |
| fwd | 11595.0 | 8249.000000 | 8396.387755 | 9811.800000 | 9997.333333 |
| rwd | 23949.6 | 24202.714286 | 14337.777778 | 21711.833333 | 16994.222222 |
Question 4:
Use the "groupby" function to find the average "price" of each car based on "body-style".
# Write your code below and press Shift+Enter to execute
df_t=df[["body-style","price"]]
df_t1=df.groupby(['body-style'],as_index=False).agg({'price':'mean'})
df_t1
| body-style | price | |
|---|---|---|
| 0 | convertible | 21890.500000 |
| 1 | hardtop | 22208.500000 |
| 2 | hatchback | 9957.441176 |
| 3 | sedan | 14459.755319 |
| 4 | wagon | 12371.960000 |
Click here for the solution
```python # grouping results df_gptest2 = df[['body-style','price']] grouped_test_bodystyle = df_gptest2.groupby(['body-style'],as_index= False).mean() grouped_test_bodystyle ```If you did not import "pyplot", let's do it again.
import matplotlib.pyplot as plt
%matplotlib inline
Variables: Drive Wheels and Body Style vs. Price
Let's use a heat map to visualize the relationship between Body Style vs Price.
#use the grouped results
plt.figure(1)
plt.pcolor(grouped_pivot, cmap='coolwarm')
plt.colorbar()
plt.show()
The heatmap plots the target variable (price) proportional to colour with respect to the variables 'drive-wheel' and 'body-style' on the vertical and horizontal axis, respectively. This allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.
The default labels convey no useful information to us. Let's change that:
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='coolwarm')
#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index
#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)
#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
#rotate label if too long
plt.xticks(rotation=45)
fig.colorbar(im)
plt.show()
Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.
The main question we want to answer in this module is, "What are the main characteristics which have the most impact on the car price?".
To get a better measure of the important characteristics, we look at the correlation of these variables with the car price. In other words: how is the car price dependent on this variable?
Correlation and Causation¶
Correlation: a measure of the extent of interdependence between variables.
Causation: the relationship between cause and effect between two variables.
It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler the determining causation as causation may require independent experimentation.
Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coefficient is a value between -1 and 1 inclusive, where:
- 1: Perfect positive linear correlation.
- 0: No linear correlation, the two variables most likely do not affect each other.
- -1: Perfect negative linear correlation.
Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.
df.select_dtypes(include=['number']).corr()
| symboling | normalized-losses | wheel-base | length | width | height | curb-weight | engine-size | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | diesel | gas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| symboling | 1.000000 | 0.466264 | -0.535987 | -0.365404 | -0.242423 | -0.550160 | -0.233118 | -0.110581 | -0.140019 | -0.008245 | -0.182196 | 0.075819 | 0.279740 | -0.035527 | 0.036233 | -0.082391 | 0.066171 | -0.196735 | 0.196735 |
| normalized-losses | 0.466264 | 1.000000 | -0.056661 | 0.019424 | 0.086802 | -0.373737 | 0.099404 | 0.112360 | -0.029862 | 0.055563 | -0.114713 | 0.217299 | 0.239543 | -0.225016 | -0.181877 | 0.133999 | 0.238567 | -0.101546 | 0.101546 |
| wheel-base | -0.535987 | -0.056661 | 1.000000 | 0.876024 | 0.814507 | 0.590742 | 0.782097 | 0.572027 | 0.493244 | 0.158502 | 0.250313 | 0.371147 | -0.360305 | -0.470606 | -0.543304 | 0.584642 | 0.476153 | 0.307237 | -0.307237 |
| length | -0.365404 | 0.019424 | 0.876024 | 1.000000 | 0.857170 | 0.492063 | 0.880665 | 0.685025 | 0.608971 | 0.124139 | 0.159733 | 0.579821 | -0.285970 | -0.665192 | -0.698142 | 0.690628 | 0.657373 | 0.211187 | -0.211187 |
| width | -0.242423 | 0.086802 | 0.814507 | 0.857170 | 1.000000 | 0.306002 | 0.866201 | 0.729436 | 0.544885 | 0.188829 | 0.189867 | 0.615077 | -0.245800 | -0.633531 | -0.680635 | 0.751265 | 0.673363 | 0.244356 | -0.244356 |
| height | -0.550160 | -0.373737 | 0.590742 | 0.492063 | 0.306002 | 1.000000 | 0.307581 | 0.074694 | 0.180449 | -0.062704 | 0.259737 | -0.087027 | -0.309974 | -0.049800 | -0.104812 | 0.135486 | 0.003811 | 0.281578 | -0.281578 |
| curb-weight | -0.233118 | 0.099404 | 0.782097 | 0.880665 | 0.866201 | 0.307581 | 1.000000 | 0.849072 | 0.644060 | 0.167562 | 0.156433 | 0.757976 | -0.279361 | -0.749543 | -0.794889 | 0.834415 | 0.785353 | 0.221046 | -0.221046 |
| engine-size | -0.110581 | 0.112360 | 0.572027 | 0.685025 | 0.729436 | 0.074694 | 0.849072 | 1.000000 | 0.572609 | 0.209523 | 0.028889 | 0.822676 | -0.256733 | -0.650546 | -0.679571 | 0.872335 | 0.745059 | 0.070779 | -0.070779 |
| bore | -0.140019 | -0.029862 | 0.493244 | 0.608971 | 0.544885 | 0.180449 | 0.644060 | 0.572609 | 1.000000 | -0.055390 | 0.001263 | 0.566936 | -0.267392 | -0.582027 | -0.591309 | 0.543155 | 0.554610 | 0.054458 | -0.054458 |
| stroke | -0.008245 | 0.055563 | 0.158502 | 0.124139 | 0.188829 | -0.062704 | 0.167562 | 0.209523 | -0.055390 | 1.000000 | 0.187923 | 0.098462 | -0.065713 | -0.034696 | -0.035201 | 0.082310 | 0.037300 | 0.241303 | -0.241303 |
| compression-ratio | -0.182196 | -0.114713 | 0.250313 | 0.159733 | 0.189867 | 0.259737 | 0.156433 | 0.028889 | 0.001263 | 0.187923 | 1.000000 | -0.214514 | -0.435780 | 0.331425 | 0.268465 | 0.071107 | -0.299372 | 0.985231 | -0.985231 |
| horsepower | 0.075819 | 0.217299 | 0.371147 | 0.579821 | 0.615077 | -0.087027 | 0.757976 | 0.822676 | 0.566936 | 0.098462 | -0.214514 | 1.000000 | 0.107885 | -0.822214 | -0.804575 | 0.809575 | 0.889488 | -0.169053 | 0.169053 |
| peak-rpm | 0.279740 | 0.239543 | -0.360305 | -0.285970 | -0.245800 | -0.309974 | -0.279361 | -0.256733 | -0.267392 | -0.065713 | -0.435780 | 0.107885 | 1.000000 | -0.115413 | -0.058598 | -0.101616 | 0.115830 | -0.475812 | 0.475812 |
| city-mpg | -0.035527 | -0.225016 | -0.470606 | -0.665192 | -0.633531 | -0.049800 | -0.749543 | -0.650546 | -0.582027 | -0.034696 | 0.331425 | -0.822214 | -0.115413 | 1.000000 | 0.972044 | -0.686571 | -0.949713 | 0.265676 | -0.265676 |
| highway-mpg | 0.036233 | -0.181877 | -0.543304 | -0.698142 | -0.680635 | -0.104812 | -0.794889 | -0.679571 | -0.591309 | -0.035201 | 0.268465 | -0.804575 | -0.058598 | 0.972044 | 1.000000 | -0.704692 | -0.930028 | 0.198690 | -0.198690 |
| price | -0.082391 | 0.133999 | 0.584642 | 0.690628 | 0.751265 | 0.135486 | 0.834415 | 0.872335 | 0.543155 | 0.082310 | 0.071107 | 0.809575 | -0.101616 | -0.686571 | -0.704692 | 1.000000 | 0.789898 | 0.110326 | -0.110326 |
| city-L/100km | 0.066171 | 0.238567 | 0.476153 | 0.657373 | 0.673363 | 0.003811 | 0.785353 | 0.745059 | 0.554610 | 0.037300 | -0.299372 | 0.889488 | 0.115830 | -0.949713 | -0.930028 | 0.789898 | 1.000000 | -0.241282 | 0.241282 |
| diesel | -0.196735 | -0.101546 | 0.307237 | 0.211187 | 0.244356 | 0.281578 | 0.221046 | 0.070779 | 0.054458 | 0.241303 | 0.985231 | -0.169053 | -0.475812 | 0.265676 | 0.198690 | 0.110326 | -0.241282 | 1.000000 | -1.000000 |
| gas | 0.196735 | 0.101546 | -0.307237 | -0.211187 | -0.244356 | -0.281578 | -0.221046 | -0.070779 | -0.054458 | -0.241303 | -0.985231 | 0.169053 | 0.475812 | -0.265676 | -0.198690 | -0.110326 | 0.241282 | -1.000000 | 1.000000 |
Sometimes we would like to know the significant of the correlation estimate.
P-value
What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.
By convention, when the
- p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.
- the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.
- the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.
- the p-value is $>$ 0.1: there is no evidence that the correlation is significant.
We can obtain this information using "stats" module in the "scipy" library.
from scipy import stats
Wheel-Base vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
The Pearson Correlation Coefficient is 0.584641822265508 with a P-value of P = 8.076488270732885e-20
Conclusion:
Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585).
Horsepower vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
The Pearson Correlation Coefficient is 0.8095745670036559 with a P-value of P = 6.369057428259557e-48
Conclusion:
Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).
Length vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
The Pearson Correlation Coefficient is 0.6906283804483638 with a P-value of P = 8.016477466159723e-30
Conclusion:
Since the p-value is $<$ 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).
Width vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value )
The Pearson Correlation Coefficient is 0.7512653440522673 with a P-value of P = 9.20033551048206e-38
Conclusion:¶
Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751).
Curb-Weight vs. Price¶
Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':
pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
The Pearson Correlation Coefficient is 0.8344145257702843 with a P-value of P = 2.189577238893965e-53
Conclusion:
Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).
Engine-Size vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':
pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
The Pearson Correlation Coefficient is 0.8723351674455185 with a P-value of P = 9.265491622198793e-64
Conclusion:
Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).
Bore vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price':
pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value )
The Pearson Correlation Coefficient is 0.5431553832626602 with a P-value of P = 8.049189483935315e-17
Conclusion:
Since the p-value is $<$ 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).
We can relate the process for each 'city-mpg' and 'highway-mpg':
City-mpg vs. Price
pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
The Pearson Correlation Coefficient is -0.6865710067844678 with a P-value of P = 2.3211320655675098e-29
Conclusion:
Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of about -0.687 shows that the relationship is negative and moderately strong.
Highway-mpg vs. Price
pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value )
The Pearson Correlation Coefficient is -0.704692265058953 with a P-value of P = 1.749547114447557e-31
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X, Y)
out=lm.score(X,Y)
print(lm.coef_)
out
[-821.73337832]
0.4965911884339175
Notebook for model development¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Load the data and store it in dataframe df:
#from pyodide.http import pyfetch
#async def download(url, filename):
# response = await pyfetch(url)
# if response.status == 200:
# with open(filename, "wb") as f:
# f.write(await response.bytes())
#file_path= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv"
#await download(file_path, "usedcars.csv")
#file_name="usedcars.csv"
Jupyter_Notesath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv"
df = pd.read_csv(Jupyter_Notesath, header=0)
#df = pd.read_csv(file_name)
df.head()
| symboling | normalized-losses | make | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | ... | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | horsepower-binned | diesel | gas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122 | alfa-romero | std | two | convertible | rwd | front | 88.6 | 0.811148 | ... | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 | 11.190476 | Medium | 0 | 1 |
| 1 | 3 | 122 | alfa-romero | std | two | convertible | rwd | front | 88.6 | 0.811148 | ... | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 | 11.190476 | Medium | 0 | 1 |
| 2 | 1 | 122 | alfa-romero | std | two | hatchback | rwd | front | 94.5 | 0.822681 | ... | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 | 12.368421 | Medium | 0 | 1 |
| 3 | 2 | 164 | audi | std | four | sedan | fwd | front | 99.8 | 0.848630 | ... | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 | 9.791667 | Medium | 0 | 1 |
| 4 | 2 | 164 | audi | std | four | sedan | 4wd | front | 99.4 | 0.848630 | ... | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 | 13.055556 | Medium | 0 | 1 |
5 rows × 29 columns
Note: This version of the lab is working on JupyterLite, which requires the dataset to be downloaded to the interface.While working on the downloaded version of this notebook on their local machines(Jupyter Anaconda), the learners can simply skip the steps above, and simply use the URL directly in the pandas.read_csv() function. You can uncomment and run the statements in the cell below.
#Jupyter_Notesath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv"
#df = pd.read_csv(Jupyter_Notesath, header=None)
1. Linear Regression and Multiple Linear Regression
Linear Regression
One example of a Data Model that we will be using is:
Simple Linear RegressionSimple Linear Regression is a method to help us understand the relationship between two variables:
- The predictor/independent variable (X)
- The response/dependent variable (that we want to predict)(Y)
The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.
$$ Y: Response \ Variable\\\\\\ X: Predictor \ Variables $$Linear Function $$ Yhat = a + b X $$
- a refers to the intercept of the regression line, in other words: the value of Y when X is 0
- b refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit
Let's load the modules for linear regression:
from sklearn.linear_model import LinearRegression
Create the linear regression object:
lm = LinearRegression()
lm
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
How could "highway-mpg" help us predict car price?
For this example, we want to look at how highway-mpg can help us predict car price. Using simple linear regression, we will create a linear function with "highway-mpg" as the predictor variable and the "price" as the response variable.
print(df.columns)
X = df[['highway-mpg']]
Y = df[['price']]
Index(['symboling', 'normalized-losses', 'make', 'aspiration', 'num-of-doors',
'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length',
'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders',
'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio',
'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price',
'city-L/100km', 'horsepower-binned', 'diesel', 'gas'],
dtype='object')
Fit the linear model using highway-mpg:
lm.fit(X,Y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We can output a prediction:
Yhat=lm.predict(X)
Yhat[0:5]
array([[16236.50464347],
[16236.50464347],
[17058.23802179],
[13771.3045085 ],
[20345.17153508]])
What is the value of the intercept (a)?
lm.intercept_
array([38423.30585816])
What is the value of the slope (b)?
lm.coef_
array([[-821.73337832]])
What is the final estimated linear model we get?
As we saw above, we should get a final linear model with the structure:
$$ Yhat = a + b X $$Plugging in the actual values we get:
Price = 38423.31 - 821.73 x highway-mpg
Question #1 a):
Create a linear regression object called "lm1".# Write your code below and press Shift+Enter to execute
lm1=LinearRegression()
lm1
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Click here for the solution
```python lm1 = LinearRegression() lm1 ```Question #1 b):
Train the model using "engine-size" as the independent variable and "price" as the dependent variable?# Write your code below and press Shift+Enter to execute
cola= ["engine-size"]
colb=["price"]
lm1.fit(df[cola],df[colb])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Click here for the solution
```python lm1.fit(df[['engine-size']], df[['price']]) lm1 ```Question #1 c):
Find the slope and intercept of the model.Slope
# Write your code below and press Shift+Enter to execute
lm1.coef_
array([[166.86001569]])
Intercept
# Write your code below and press Shift+Enter to execute
lm1.intercept_
array([-7963.33890628])
Click here for the solution
```python # Slope lm1.coef_ # Intercept lm1.intercept_ ```Question #1 d):
What is the equation of the predicted line? You can use x and yhat or "engine-size" or "price".# Write your code below and press Shift+Enter to execute
print(["Eq. is Y=mx+b: Predicted price = {:.3f} * 'engine-size' {:.3f}".format(lm1.coef_[0][0],[np.sign(lm1.intercept_)[0]*abs(lm1.intercept_)][0][0])])
["Eq. is Y=mx+b: Predicted price = 166.860 * 'engine-size' -7963.339"]
Click here for the solution
```python # using X and Y Yhat=-7963.34 + 166.86*X Price=-7963.34 + 166.86*df['engine-size'] ```Multiple Linear Regression
What if we want to predict car price using more than one variable?
If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables. Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:
$$ Y: Response \ Variable\\\\\\ X_1 :Predictor\ Variable \ 1\\ X_2: Predictor\ Variable \ 2\\ X_3: Predictor\ Variable \ 3\\ X_4: Predictor\ Variable \ 4\\ $$ $$ a: intercept\\\\\\ b_1 :coefficients \ of\ Variable \ 1\\ b_2: coefficients \ of\ Variable \ 2\\ b_3: coefficients \ of\ Variable \ 3\\ b_4: coefficients \ of\ Variable \ 4\\ $$The equation is given by:
$$ Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4 $$From the previous section we know that other good predictors of price could be:
- Horsepower
- Curb-weight
- Engine-size
- Highway-mpg
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
Fit the linear model using the four above-mentioned variables.
lm.fit(Z, df['price'])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
What is the value of the intercept(a)?
lm.intercept_
-15806.624626329198
What are the values of the coefficients (b1, b2, b3, b4)?
lm.coef_
array([53.49574423, 4.70770099, 81.53026382, 36.05748882])
What is the final estimated linear model that we get?
As we saw above, we should get a final linear function with the structure:
$$ Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4 $$What is the linear function we get in this example?
Price = -15678.742628061467 + 52.65851272 x horsepower + 4.69878948 x curb-weight + 81.95906216 x engine-size + 33.58258185 x highway-mpg
Question #2 a):
Create and train a Multiple Linear Regression model "lm2" where the response variable is "price", and the predictor variable is "normalized-losses" and "highway-mpg".# Write your code below and press Shift+Enter to execute
lm2=LinearRegression()
dats=df[["normalized-losses", "highway-mpg"]]
lm2.fit(dats,df['price'])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Click here for the solution
```python lm2 = LinearRegression() lm2.fit(df[['normalized-losses' , 'highway-mpg']],df['price']) ```Question #2 b):
Find the coefficient of the model.# Write your code below and press Shift+Enter to execute
lm2.coef_
array([ 1.49789586, -820.45434016])
Click here for the solution
```python lm2.coef_ ```2. Model Evaluation Using Visualization
Now that we've developed some models, how do we evaluate our models and choose the best one? One way to do this is by using a visualization.
Import the visualization package, seaborn:
# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline
Regression Plot
When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.
This plot will show a combination of a scattered data points (a scatterplot), as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).
Let's visualize highway-mpg as potential predictor variable of price:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
(0.0, 48163.11429508339)
We can see from this plot that price is negatively correlated to highway-mpg since the regression slope is negative. One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data. Let's compare this plot to the regression plot of "peak-rpm".
plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)
(0.0, 47414.1)
Comparing the regression plot of "peak-rpm" and "highway-mpg", we see that the points for "highway-mpg" are much closer to the generated line and, on average, decrease. The points for "peak-rpm" have more spread around the predicted line and it is much harder to determine if the points are decreasing or increasing as the "peak-rpm" increases.
Question #3:
Given the regression plots above, is "peak-rpm" or "highway-mpg" more strongly correlated with "price"? Use the method ".corr()" to verify your answer.# Write your code below and press Shift+Enter to execute
#import piplite
#await piplite.install("jinja2")
corres=df[["peak-rpm","highway-mpg","price"]].corr()
styled_df = corres[corres<1].style.highlight_min(axis=0,color=[1,1,0])
styled_df
| peak-rpm | highway-mpg | price | |
|---|---|---|---|
| peak-rpm | nan | -0.058598 | -0.101616 |
| highway-mpg | -0.058598 | nan | -0.704692 |
| price | -0.101616 | -0.704692 | nan |
Click here for the solution
```python # The variable "highway-mpg" has a stronger correlation with "price", it is approximate -0.704692 compared to "peak-rpm" which is approximate -0.101616. You can verify it using the following command: df[["peak-rpm","highway-mpg","price"]].corr() ```Residual Plot
A good way to visualize the variance of the data is to use a residual plot.
What is a residual?
The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.
So what is a residual plot?
A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.
What do we pay attention to when looking at a residual plot?
We look at the spread of the residuals:
- If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.
width = 6
height = 5
plt.figure(figsize=(width, height))
sns.residplot(x=df['highway-mpg'], y=df['price'])
plt.show()
What is this plot telling us?
We can see from this residual plot that the residuals are not randomly spread around the x-axis, leading us to believe that maybe a non-linear model is more appropriate for this data.
Multiple Linear Regression
How do we visualize a model for Multiple Linear Regression? This gets a bit more complicated because you can't visualize it with regression or residual plot.
One way to look at the fit of the model is by looking at the distribution plot. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.
First, let's make a prediction:
Y_hat = lm.predict(Z)
plt.figure(figsize=(width, height))
ax1 = sns.kdeplot(df['price'], color="r", label="Actual Value")
sns.kdeplot(Y_hat, color="b", label="Fitted Values" , ax=ax1)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.
3. Polynomial Regression and Pipelines
Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.
We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.
There are different orders of polynomial regression:
We saw earlier that a linear model did not provide the best fit while using "highway-mpg" as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.
We will use the following function to plot the data:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
x_new = np.linspace(15, 55, 100)
y_new = model(x_new)
plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')
plt.show()
plt.close()
Let's get the variables:
x = df['highway-mpg']
y = df['price']
Let's fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.
# Here we use a polynomial of the 3rd order (cubic)
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)
3 2 -1.557 x + 204.8 x - 8965 x + 1.379e+05
Let's plot the function:
PlotPolly(p, x, y, 'highway-mpg')
np.polyfit(x, y, 3)
array([-1.55663829e+00, 2.04754306e+02, -8.96543312e+03, 1.37923594e+05])
We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function "hits" more of the data points.
Question #4:
Create 11 order polynomial model with the variables x and y from above.# Write your code below and press Shift+Enter to execute
f=np.polyfit(x,y,11)
p = np.poly1d(f)
print(p)
PlotPolly(p,x,y, 'Highway MPG')
11 10 9 8 7
-1.243e-08 x + 4.722e-06 x - 0.0008028 x + 0.08056 x - 5.297 x
6 5 4 3 2
+ 239.5 x - 7588 x + 1.684e+05 x - 2.565e+06 x + 2.551e+07 x - 1.491e+08 x + 3.879e+08
Click here for the solution
```python # Here we use a polynomial of the 11rd order (cubic) f1 = np.polyfit(x, y, 11) p1 = np.poly1d(f1) print(p1) PlotPolly(p1,x,y, 'Highway MPG') ```The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by:
$$ Yhat = a + b_1 X_1 +b_2 X_2 +b_3 X_1 X_2+b_4 X_1^2+b_5 X_2^2 $$We can perform a polynomial transform on multiple features. First, we import the module:
from sklearn.preprocessing import PolynomialFeatures
We create a PolynomialFeatures object of degree 2:
pr=PolynomialFeatures(degree=2)
pr
PolynomialFeatures()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PolynomialFeatures()
Z_pr=pr.fit_transform(Z)
In the original data, there are 201 samples and 4 features.
Z.shape
(201, 4)
After the transformation, there are 201 samples and 15 features.
Z_pr.shape
(201, 15)
Pipeline
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
We input the list as an argument to the pipeline constructor:
pipe=Pipeline(Input)
pipe
Pipeline(steps=[('scale', StandardScaler()),
('polynomial', PolynomialFeatures(include_bias=False)),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scale', StandardScaler()),
('polynomial', PolynomialFeatures(include_bias=False)),
('model', LinearRegression())])StandardScaler()
PolynomialFeatures(include_bias=False)
LinearRegression()
First, we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs.
Then, we can normalize the data, perform a transform and fit the model simultaneously.
Z = Z.astype(float)
pipe.fit(Z,y)
Pipeline(steps=[('scale', StandardScaler()),
('polynomial', PolynomialFeatures(include_bias=False)),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scale', StandardScaler()),
('polynomial', PolynomialFeatures(include_bias=False)),
('model', LinearRegression())])StandardScaler()
PolynomialFeatures(include_bias=False)
LinearRegression()
Similarly, we can normalize the data, perform a transform and produce a prediction simultaneously.
ypipe=pipe.predict(Z)
ypipe[0:4]
array([13102.74784201, 13102.74784201, 18225.54572197, 10390.29636555])
Question #5:
Create a pipeline that standardizes the data, then produce a prediction using a linear regression model using the features Z and target y.# Write your code below and press Shift+Enter to execute
Input=[('scale',StandardScaler()),('model',LinearRegression())]
Pipe=Pipeline(Input)
Pipe
Z = Z.astype(float)
Pipe.fit(Z,y)
yhat=Pipe.predict(Z)
yhat[0:5]
array([13699.11161184, 13699.11161184, 19051.65470233, 10620.36193015,
15521.31420211])
Click here for the solution
```python Input=[('scale',StandardScaler()),('model',LinearRegression())] pipe=Pipeline(Input) pipe.fit(Z,y) ypipe=pipe.predict(Z) ypipe[0:10] ```4. Measures for In-Sample Evaluation
When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.
Two very important measures that are often used in Statistics to determine the accuracy of a model are:
- R^2 / R-squared
- Mean Squared Error (MSE)
R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.
The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.
Mean Squared Error (MSE)
The Mean Squared Error measures the average of the squares of errors. That is, the difference between actual value (y) and the estimated value (ŷ).
Model 1: Simple Linear Regression
Let's calculate the R^2:
#highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))
The R-square is: 0.4965911884339175
We can say that ~49.659% of the variation of the price is explained by this simple linear model "horsepower_fit".
Let's calculate the MSE:
We can predict the output i.e., "yhat" using the predict method, where X is the input variable:
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])
The output of the first four predicted value is: [[16236.50464347] [16236.50464347] [17058.23802179] [13771.3045085 ]]
Let's import the function mean_squared_error from the module metrics:
from sklearn.metrics import mean_squared_error
We can compare the predicted results with the actual results:
mse = mean_squared_error(df['price'], Yhat)
print('The mean square error of price and predicted value is: ', mse)
The mean square error of price and predicted value is: 31635042.944639895
Model 2: Multiple Linear Regression
Let's calculate the R^2:
# fit the model
lm.fit(Z, df['price'])
# Find the R^2
print('The R-square is: ', lm.score(Z, df['price']))
The R-square is: 0.8093562806577457
We can say that ~80.896 % of the variation of price is explained by this multiple linear regression "multi_fit".
Let's calculate the MSE.
We produce a prediction:
Y_predict_multifit = lm.predict(Z)
We compare the predicted results with the actual results:
print('The mean square error of price and predicted value using multifit is: ', \
mean_squared_error(df['price'], Y_predict_multifit))
The mean square error of price and predicted value using multifit is: 11980366.87072649
Model 3: Polynomial Fit
Let's calculate the R^2.
Let’s import the function r2_score from the module metrics as we are using a different function.
from sklearn.metrics import r2_score
We apply the function to get the value of R^2:
r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)
The R-square value is: 0.702376909204032
We can say that ~67.419 % of the variation of price is explained by this polynomial fit.
MSE
We can also calculate the MSE:
mean_squared_error(df['price'], p(x))
18703127.64164033
5. Prediction and Decision Making
Prediction
In the previous section, we trained the model using the method fit. Now we will use the method predict to produce a prediction. Lets import pyplot for plotting; we will also be using some functions from numpy.
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
Create a new input:
new_input=np.arange(1, 100, 1).reshape(-1, 1)
Fit the model:
lm.fit(X, Y)
lm
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Produce a prediction:
yhat=lm.predict(new_input)
yhat[0:5]
array([[37601.57247984],
[36779.83910151],
[35958.10572319],
[35136.37234487],
[34314.63896655]])
We can plot the data:
plt.plot(new_input, yhat)
plt.show()
Decision Making: Determining a Good Model Fit
Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?
- What is a good R-squared value?
When comparing models, the model with the higher R-squared value is a better fit for the data.
- What is a good MSE?
When comparing models, the model with the smallest MSE value is a better fit for the data.
Let's take a look at the values for the different models.
Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.
- R-squared: 0.49659118843391759
- MSE: 3.16 x10^7
Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg as Predictor Variables of Price.
- R-squared: 0.80896354913783497
- MSE: 1.2 x10^7
Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.
- R-squared: 0.6741946663906514
- MSE: 2.05 x 10^7
Simple Linear Regression Model (SLR) vs Multiple Linear Regression Model (MLR)
Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and even act as noise. As a result, you should always check the MSE and R^2.
In order to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model.
- MSE: The MSE of SLR is 3.16x10^7 while MLR has an MSE of 1.2 x10^7. The MSE of MLR is much smaller.
- R-squared: In this case, we can also see that there is a big difference between the R-squared of the SLR and the R-squared of the MLR. The R-squared for the SLR (~0.497) is very small compared to the R-squared for the MLR (~0.809).
This R-squared in combination with the MSE show that MLR seems like the better model fit in this case compared to SLR.
Simple Linear Model (SLR) vs. Polynomial Fit
- MSE: We can see that Polynomial Fit brought down the MSE, since this MSE is smaller than the one from the SLR.
- R-squared: The R-squared for the Polynomial Fit is larger than the R-squared for the SLR, so the Polynomial Fit also brought up the R-squared quite a bit.
Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting "price" with "highway-mpg" as a predictor variable.
Multiple Linear Regression (MLR) vs. Polynomial Fit
- MSE: The MSE for the MLR is smaller than the MSE for the Polynomial Fit.
- R-squared: The R-squared for the MLR is also much larger than for the Polynomial Fit.
Conclusion
Comparing these three models, we conclude that the MLR model is the best model to be able to predict price from our dataset. This result makes sense since we have 27 variables in total and we know that more than one of those variables are potential predictors of the final car price.
Module 4 summary¶
- Linear regression refers to using one independent variable to make a prediction.
- You can use multiple linear regression to explain the relationship between one continuous target y variable and two or more predictor x variables.
- Simple linear regression, or SLR, is a method used to understand the relationship between two variables, the predictor independent variable x and the target dependent variable y.
- Use the regplot and residplot functions in the Seaborn library to create regression and residual plots, which help you identify the strength, direction, and linearity of the relationship between your independent and dependent variables.
- When using residual plots for model evaluation, residuals should ideally have zero mean, appear evenly distributed around the x-axis, and have consistent variance. If these conditions are not met, consider adjusting your model.
- Use distribution plots for models with multiple features: Learn to construct distribution plots to compare predicted and actual values, particularly when your model includes more than one independent variable. Know that this can offer deeper insights into the accuracy of your model across different ranges of values.
- The order of the polynomials affects the fit of the model to your data. Apply Python's polyfit function to develop polynomial regression models that suit your specific dataset.
- To prepare your data for more accurate modeling, use feature transformation techniques, particularly using the preprocessing library in scikit-learn, transform your data using polynomial features, and use the modules like StandardScaler to normalize the data.
- Pipelines allow you to simplify how you perform transformations and predictions sequentially, and you can use pipelines in scikit-learn to streamline your modeling process.
- You can construct and train a pipeline to automate tasks such as normalization, polynomial transformation, and making predictions.
- To determine the fit of your model, you can perform sample evaluations by using the Mean Square Error (MSE), using Python’s mean_squared_error function from scikit-learn, and using the score method to obtain the R-squared value.
- A model with a high R-squared value close to 1 and a low MSE is generally a good fit, whereas a model with a low R-squared and a high MSE may not be useful.
- Be alert to situations where your R-squared value might be negative, which can indicate overfitting.
- When evaluating models, use visualization and numerical measures and compare different models.
- The mean square error is perhaps the most intuitive numerical measure for determining whether a model is good.
- A distribution plot is a suitable method for multiple linear regression.
- An acceptable r-squared value depends on what you are studying and your use case.
- To evaluate your model’s fit, apply visualization, methods like regression and residual plots, and numerical measures such as the model's coefficients for sensibility:
- Use Mean Square Error (MSE) to measure the average of the squares of the errors between actual and predicted values and examine R-squared to understand the proportion of the variance in the dependent variable that is predictable from the independent variables.
- When analyzing residual plots, residuals should be randomly distributed around zero for a good model. In contrast, a residual plot curve or inaccuracies in certain ranges suggest non-linear behavior or the need for more data.
Module 5¶
Introduction to Ridge Regression
For models with multiple independent features and ones with polynomial feature extrapolation, it is common to have colinear combinations of features. Left unchecked, this multicollinearity of features can lead the model to overfit the training data. To control this, the feature sets are typically regularized using hyperparameters.
Ridge regression is the process of regularizing the feature set using the hyperparameter alpha. The upcoming video shows how Ridge regression can be utilized to regularize and reduce standard errors and avoid over-fitting while using a regression model.
NB for practice Project: Insurance Cost Analysis**¶
Estimated time needed: 75 minutes
In this project, you have to perform analytics operations on an insurance database that uses the below mentioned parameters.
| Parameter | Description | Content type |
|---|---|---|
| age | Age in years | integer |
| gender | Male or Female | integer (1 or 2) |
| bmi | Body mass index | float |
| no_of_children | Number of children | integer |
| smoker | Whether smoker or not | integer (0 or 1) |
| region | Which US region - NW, NE, SW, SE | integer (1,2,3 or 4 respectively) |
| charges | Annual Insurance charges in USD | float |
Objectives¶
In this project, you will:
- Load the data as a
pandasdataframe - Clean the data, taking care of the blank entries
- Run exploratory data analysis (EDA) and identify the attributes that most affect the
charges - Develop single variable and multi variable Linear Regression models for predicting the
charges - Use Ridge regression to refine the performance of Linear regression models.
Setup¶
- Define a function to plot comparison between Actual ad predicted distributions
def plotdists(D1,D2,L1,L2,Lg):
#fig, ax=plt.subplots()
sns.kdeplot(D1,color='b',label=L1)
sns.kdeplot(D2,color='r',label=L2)
plt.legend(Lg)
Importing Required Libraries¶
We recommend you import all required libraries in one place (here):
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
warnings.filterwarnings('ignore')
Click here for Solution
```python import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import LinearRegression, Ridge from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import cross_val_score, train_test_split ```Download the dataset to this lab environment¶
Task 1 : Import the dataset¶
Import the dataset into a pandas dataframe. Note that there are currently no headers in the CSV file.
Print the first 10 rows of the dataframe to confirm successful loading.
import pandas as pd
Jupyter_Notesath = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/medical_insurance_dataset.csv'
df = pd.read_csv(Jupyter_Notesath, header=None)
df.head(10)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | 1 | 27.900 | 0 | 1 | 3 | 16884.92400 |
| 1 | 18 | 2 | 33.770 | 1 | 0 | 4 | 1725.55230 |
| 2 | 28 | 2 | 33.000 | 3 | 0 | 4 | 4449.46200 |
| 3 | 33 | 2 | 22.705 | 0 | 0 | 1 | 21984.47061 |
| 4 | 32 | 2 | 28.880 | 0 | 0 | 1 | 3866.85520 |
| 5 | 31 | 1 | 25.740 | 0 | ? | 4 | 3756.62160 |
| 6 | 46 | 1 | 33.440 | 1 | 0 | 4 | 8240.58960 |
| 7 | 37 | 1 | 27.740 | 3 | 0 | 1 | 7281.50560 |
| 8 | 37 | 2 | 29.830 | 2 | 0 | 2 | 6406.41070 |
| 9 | 60 | 1 | 25.840 | 0 | 0 | 1 | 28923.13692 |
Click here for Solution
```python df = pd.read_csv(file_name, header=None) print(df.head(10)) ```Add the headers to the dataframe, as mentioned in the project scenario.
cols=['age','gender','bmi','no_of_children','smoker','region','charges']
df.columns=cols
df.head()
| age | gender | bmi | no_of_children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | 1 | 27.900 | 0 | 1 | 3 | 16884.92400 |
| 1 | 18 | 2 | 33.770 | 1 | 0 | 4 | 1725.55230 |
| 2 | 28 | 2 | 33.000 | 3 | 0 | 4 | 4449.46200 |
| 3 | 33 | 2 | 22.705 | 0 | 0 | 1 | 21984.47061 |
| 4 | 32 | 2 | 28.880 | 0 | 0 | 1 | 3866.85520 |
Click here for Solution
```python headers = ["age", "gender", "bmi", "no_of_children", "smoker", "region", "charges"] df.columns = headers ```Now, replace the '?' entries with 'NaN' values.
df.replace('?',np.nan,inplace=True)
Click here for Solution
```python df.replace('?', np.nan, inplace = True) ```Task 2 : Data Wrangling¶
Use dataframe.info() to identify the columns that have some 'Null' (or NaN) information.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2772 entries, 0 to 2771 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 2768 non-null object 1 gender 2772 non-null int64 2 bmi 2772 non-null float64 3 no_of_children 2772 non-null int64 4 smoker 2765 non-null object 5 region 2772 non-null int64 6 charges 2772 non-null float64 dtypes: float64(2), int64(3), object(2) memory usage: 151.7+ KB
Click here for Solution
```python print(df.info()) ```Handle missing data:
- For continuous attributes (e.g., age), replace missing values with the mean.
- For categorical attributes (e.g., smoker), replace missing values with the most frequent value.
- Update the data types of the respective columns.
- Verify the update using
df.info().
is_smoker = df['smoker'].value_counts().idxmax()
df["smoker"].replace(np.nan, is_smoker,inplace=True)
df['smoker']=df['smoker'].astype(int)
mnage = df['age'].value_counts().mean()
df["age"].replace(np.nan, mnage,inplace=True)
df['age']=df['age'].astype(int)
df.info();
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2772 entries, 0 to 2771 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 2772 non-null int32 1 gender 2772 non-null int64 2 bmi 2772 non-null float64 3 no_of_children 2772 non-null int64 4 smoker 2772 non-null int32 5 region 2772 non-null int64 6 charges 2772 non-null float64 dtypes: float64(2), int32(2), int64(3) memory usage: 130.1 KB
Click here for Solution
```python # smoker is a categorical attribute, replace with most frequent entry is_smoker = df['smoker'].value_counts().idxmax() df["smoker"].replace(np.nan, is_smoker, inplace=True) # age is a continuous variable, replace with mean age mean_age = df['age'].astype('float').mean(axis=0) df["age"].replace(np.nan, mean_age, inplace=True) # Update data types df[["age","smoker"]] = df[["age","smoker"]].astype("int") print(df.info()) ```Also note, that the charges column has values which are more than 2 decimal places long. Update the charges column such that all values are rounded to nearest 2 decimal places. Verify conversion by printing the first 5 values of the updated dataframe.
df.head()
df['charges']=round(df['charges'],2)
df.head()
| age | gender | bmi | no_of_children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | 1 | 27.900 | 0 | 1 | 3 | 16884.92 |
| 1 | 18 | 2 | 33.770 | 1 | 0 | 4 | 1725.55 |
| 2 | 28 | 2 | 33.000 | 3 | 0 | 4 | 4449.46 |
| 3 | 33 | 2 | 22.705 | 0 | 0 | 1 | 21984.47 |
| 4 | 32 | 2 | 28.880 | 0 | 0 | 1 | 3866.86 |
Click here for Solution
```python df[["charges"]] = np.round(df[["charges"]],2) print(df.head()) ```Task 3 : Exploratory Data Analysis (EDA)¶
Implement the regression plot for charges with respect to bmi.
#lm=LinearRegression()
f,xs=plt.subplots(2,1)
sns.regplot(x=df['bmi'],y=df['charges'],data=df,ax=xs[0]);
sns.regplot(x=df['age'],y=df['charges'],data=df,ax=xs[1]);
plt.subplots_adjust(hspace=0.4)
Click here for Solution
```python sns.regplot(x="bmi", y="charges", data=df, line_kws={"color": "red"}) plt.ylim(0,) ```Implement the box plot for charges with respect to smoker.
sns.boxplot(data=df,x=df['smoker'],y=df['charges'])
<Axes: xlabel='age', ylabel='charges'>
Click here for Solution
```python sns.boxplot(x="smoker", y="charges", data=df) ```Print the correlation matrix for the dataset.
fig,ax = plt.subplots()
mt=ax.pcolor(df.corr())
ax.set_xticks(np.arange(len(list(df.columns)))+.5,minor=False)
ax.set_xticklabels(df.columns)
plt.xticks(rotation=45);
ax.set_yticks(np.arange(len(list(df.columns)))+.5,minor=False)
ax.set_yticklabels(df.columns);
ax.set_aspect('equal', 'box')
fig.colorbar(mt);
Click here for Solution
```python print(df.corr()) ```Task 4 : Model Development¶
Fit a linear regression model that may be used to predict the charges value, just by using the smoker attribute of the dataset. Print the $ R^2 $ score of this model.
lr=LinearRegression()
lr
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Click here for Solution
```python X = df[['smoker']] Y = df['charges'] lm = LinearRegression() lm.fit(X,Y) print(lm.score(X, Y)) ```Fit a linear regression model that may be used to predict the charges value, just by using all other attributes of the dataset. Print the $ R^2 $ score of this model. You should see an improvement in the performance.
from sklearn.metrics import r2_score
xdt=df.drop('charges',axis=1)
ydt=df['charges']
lr.fit(xdt,ydt)
lr.score(xdt,ydt)
yhat=lr.predict(xdt)
yhat=yhat.reshape(-1,1)
scr=r2_score(yhat,ydt)
plotdists(ydt,yhat,'Original','Predicted',[f"R\N{superscript two}={round(scr,4)}"])
Click here for Solution
```python # definition of Y and lm remain same as used in last cell. Z = df[["age", "gender", "bmi", "no_of_children", "smoker", "region"]] lm.fit(Z,Y) print(lm.score(Z, Y)) ```Create a training pipeline that uses StandardScaler(), PolynomialFeatures() and LinearRegression() to create a model that can predict the charges value using all the other attributes of the dataset. There should be even further improvement in the performance.
Poly=PolynomialFeatures()
Multi_tr=Poly.fit_transform(xdt)
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
pipe=Pipeline(Input)
pipe.fit(Multi_tr,ydt)
ypipt=pipe.predict(Multi_tr)
#print(pipe.score(Multi_tr,ydt))
plotdists(ydt,ypipt,'Original','Predicted',[f"R\N{superscript two}={round(pipe.score(Multi_tr,ydt),4)}"])
Click here for Solution
```python # Y and Z use the same values as defined in previous cells Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model', LinearRegression())] pipe=Pipeline(Input) Z = Z.astype(float) pipe.fit(Z,Y) ypipe=pipe.predict(Z) print(r2_score(Y,ypipe)) ```Task 5 : Model Refinement¶
Split the data into training and testing subsets, assuming that 20% of the data will be reserved for testing.
xtr,xts,ytr,yts=train_test_split(xdt,ydt,test_size=0.2,random_state=0)
Click here for Solution
```python # Z and Y hold same values as in previous cells x_train, x_test, y_train, y_test = train_test_split(Z, Y, test_size=0.2, random_state=1) ```Initialize a Ridge regressor that used hyperparameter $ \alpha = 0.1 $. Fit the model using training data data subset. Print the $ R^2 $ score for the testing data.
from sklearn.metrics import r2_score
RigeModel=Ridge(alpha=0.1)
RigeModel.fit(xtr,ytr)
test_score= RigeModel.score(xts,yts)
print('Ridge score= ', test_score)
yRidge= RigeModel.predict(xts)
print('Sklearn score= ',r2_score(yRidge,yts))
plotdists(yts,yRidge,'Original','Predicted',[f"R\N{superscript two}={round(r2_score(yRidge,yts),4)}"])
Ridge score= 0.7452378156489365 Sklearn score= 0.6638637679078936
Click here for Solution
```python # x_train, x_test, y_train, y_test hold same values as in previous cells RidgeModel=Ridge(alpha=0.1) RidgeModel.fit(x_train, y_train) yhat = RidgeModel.predict(x_test) print(r2_score(y_test,yhat)) ```Apply polynomial transformation to the training parameters with degree=2. Use this transformed feature set to fit the same regression model, as above, using the training subset. Print the $ R^2 $ score for the testing subset.
pl2=PolynomialFeatures(degree=2)
data_tr_poly2=pl2.fit_transform(xtr)
data_ts_poly2=pl2.fit_transform(xts)
RigeModel.fit(data_tr_poly2,ytr)
predicted=RigeModel.predict(data_ts_poly2)
plotdists(ydt,ypipt,'Original','Predicted',[f"R\N{superscript two}={round(r2_score(predicted,yts),4)}"])
Lesson Summary¶
How to split your data using the train_test_split() method into training and test sets. You use the training set to train a model, discover possible predictive relationships, and then use the test set to test your model to evaluate its performance.
How to use the generalization error to measure how well your data does at predicting previously unseen data.
How to use cross-validation by splitting the data into folds where you use some of the folds as a training set, which we use to train the model, and the remaining parts are used as a test set, which we use to test the model. You iterate through the folds until you use each partition for training and testing. At the end, you average results as the estimate of out-of-sample error.
How to pick the best polynomial order and problems that arise when selecting the wrong order polynomial by analyzing models that underfit and overfit your data.
Select the best order of a polynomial to fit your data by minimizing the test error using a graph comparing the mean square error to the order of the fitted polynomials.
You should use ridge regression when there is a strong relationship among the independent variables.
That ridge regression prevents overfitting.
Ridge regression controls the magnitude of polynomial coefficients by introducing a hyperparameter, alpha.
To determine alpha, you divide your data into training and validation data. Starting with a small value for alpha, you train the model, make a prediction using the validation data, then calculate the R-squared and store the values. You repeat the value for a larger value of alpha. You repeat the process for different alpha values, training the model, and making a prediction. You select the value of alpha that maximizes R-squared.
That grid search allows you to scan through multiple hyperparameters using the Scikit-learn library, which iterates over these parameters using cross-validation. Based on the results of the grid search method, you select optimum hyperparameter values.
The GridSearchCV() method takes in a dictionary as its argument where the key is the name of the hyperparameter, and the values are the hyperparameter values you wish to iterate over.
try:
!jupyter nbconvert Data_Analysis_Notes.ipynb --to html --template pj
except Exception as e:
print('HTML not stored')
import shutil
import os
import shutil
FromFld='C:\\Users\\Gamaliel\\Documents\\G\\ADD\\IBM_DS\\Data_Analysis_Py\\'
Tofld='C:\\Users\\Gamaliel\\Documents\\G\\ADD\\IBM_DS\\IBM_DS_Jupyter_Tasks\\Python4DataScience\\'
HTML_Notes='Data_Analysis_Notes.html'
Jupyter_Notes='Data_Analysis_Notes.ipynb'
try:
if os.path.isfile(Tofld+'/'+HTML_Notes):
os.remove(Tofld+'/'+HTML_Notes)
print(HTML_Notes, 'deleted in', Tofld)
shutil.move(os.path.join(FromFld,HTML_Notes),os.path.join(Tofld,HTML_Notes))
print(HTML_Notes, 'replaced in', Tofld)
else:
shutil.move(os.path.join(FromFld,HTML_Notes),os.path.join(Tofld,HTML_Notes))
print(HTML_Notes, 'written in', Tofld)
except Exception as e:
print('HTML not moved')
# NB
try:
if os.path.isfile(Tofld+'/'+Jupyter_Notes):
os.remove(Tofld+'/'+Jupyter_Notes)
print(Jupyter_Notes, 'deleted in', Tofld)
shutil.copy(os.path.join(FromFld,Jupyter_Notes),os.path.join(Tofld,Jupyter_Notes))
print(Jupyter_Notes, 'copied in', Tofld)
else:
shutil.copy(os.path.join(FromFld,Jupyter_Notes),os.path.join(Tofld,Jupyter_Notes))
print(Jupyter_Notes, 'copied in', Tofld)
except Exception as e:
print('NB not moved')
Data_Analysis_Notes.html written in C:\Users\Gamaliel\Documents\G\ADD\IBM_DS\IBM_DS_Jupyter_Tasks\Python4DataScience\ Data_Analysis_Notes.ipynb copied in C:\Users\Gamaliel\Documents\G\ADD\IBM_DS\IBM_DS_Jupyter_Tasks\Python4DataScience\